Crop ExampleThis example shows how to use the Crop object for disk-based combo running - either for persistent progress or distributed processing.
First let's define a very simple function, describe it with a Runner and Harvester and set the combos for this first set of runs.
In [1]:
import xyzpy as xyz
def foo(a, b):
return a + b, a - b
r = xyz.Runner(foo, ['sum', 'diff'])
h = xyz.Harvester(r, data_name='foo_data.h5')
combos = {'a': range(0, 10),
'b': range(0, 10)}
We could use the harvester to generate data locally. But if we want results to be written to disk, either for persistence or to run them elsewhere, we need to create a Crop.
In [2]:
c = h.Crop(name='first_run', batchsize=5)
c
Out[2]:
In [3]:
c.sow_combos(combos)
There is now a hidden directory containing everything the crop needs:
In [4]:
ls -a
And inside that are folders for the batches and results, the pickled function, and some other dumped settings:
In [5]:
ls .xyz-first_run/
Once sown, we can check the progress of the Crop:
In [6]:
c
Out[6]:
There are a hundred combinations, with a batchsize of 5, yielding 20 batches to be processed.
Any python process with access to the sown batches in .xyz-first_run (and the function requirements) can grow the results (you could even zip the folder up and send elsewhere). The process can be run in several ways:
.xyz-first_run folder itself, using e.g:python -c "import xyzpy; xyzpy.grow(i)" # with i = 1 ... 20
python -c "import xyzpy; crop=xyzpy.Crop(name='fist_run'); xyzpy.grow(i, crop=crop)"
python -c "import xyzpy; crop=xyzpy.Crop(name='first_run', parent_dir='.../xyzpy/docs/examples'); xyzpy.grow(i, crop=crop)"
To fake this happening we can run grow ourselves (this cell could standalone):
In [7]:
import xyzpy
crop = xyzpy.Crop(name='first_run')
for i in range(1, 11):
xyzpy.grow(i, crop=crop)
And now we can check the progress:
In [8]:
print(c)
If we were on a qsub based batch system we could use Crop.qsub_grow to automatically submit all missing batches as jobs. It is worth double checking the script that is used first though! This is done using Crop.gen_qsub_script:
In [9]:
print(c.gen_qsub_script(minutes=20, gigabytes=1))
The default scheduler is 'sge' (Sun Grid Engine), however you can also specify 'pbs' (Portable Batch System):
In [10]:
print(c.gen_qsub_script(minutes=20, gigabytes=1, scheduler='pbs'))
If you are just using the Crop as a persistence mechanism, then Crop.grow or Crop.grow_missing will process the batches in the current process:
In [11]:
c.grow_missing(parallel=True) # this accepts combo_runner kwargs
In [12]:
print(c)
In [13]:
c.reap()
Out[13]:
The dataset foo_data.h5 should be on disk, and the crop folder cleaned up:
In [14]:
ls -a
And we can inspect the results:
In [15]:
h.full_ds.xyz.iheatmap('a', 'b', 'diff')
Out[15]:
Many crops can be created from the harvester at once, and when they are reaped, the results should be seamlessly combined into the on-disk dataset.
In [16]:
# for now clean up
h.delete_ds()